Search CORE

96 research outputs found

Pregelix: Big(ger) Graph Analytics on A Dataflow Engine

Author: Borkar Vinayak
Bu Yingyi
Carey Michael J.
Condie Tyson
Jia Jianfeng
Publication venue
Publication date: 02/07/2014
Field of study

There is a growing need for distributed graph processing systems that are capable of gracefully scaling to very large graph datasets. Unfortunately, this challenge has not been easily met due to the intense memory pressure imposed by process-centric, message passing designs that many graph processing systems follow. Pregelix is a new open source distributed graph processing system that is based on an iterative dataflow design that is better tuned to handle both in-memory and out-of-core workloads. As such, Pregelix offers improved performance characteristics and scaling properties over current open source systems (e.g., we have seen up to 15x speedup compared to Apache Giraph and up to 35x speedup compared to distributed GraphLab), and makes more effective use of available machine resources to support Big(ger) Graph Analytics

arXiv.org e-Print Archive

CiteSeerX

Declarative Algorithms in Datalog with Extrema: Their Formal Semantics Simplified

Author: Condie Tyson
Das Ariyam
Interlandi Matteo
Shkapsky Alexander
Yang Mohan
Zaniolo Carlo
Publication venue: OASIcs - OpenAccess Series in Informatics. Technical Communications of the 34th International Conference on Logic Programming (ICLP 2018)
Publication date: 01/01/2018
Field of study

Recent advances are making possible the use of aggregates in recursive queries thus enabling the declarative expression classic algorithms and their efficient and scalable implementation. These advances rely the notion of Pre-Mappability (PreM) of constraints that, along with the seminaive-fixpoint operational semantics, guarantees formal non-monotonic semantics for recursive programs with min and max constraints. In this extended abstract, we introduce basic templates to simplify and automate task of proving PreM

Dagstuhl Research Online Publication Server

Public Health for the Internet φ Towards A New Grand Challenge for Information Management

Author: Condie Tyson
Garofalakis Minos
Hellerstein Joseph M
Loo Boon Thau
Maniatis Petros
Roscoe Timothy
Taft Nina A
Publication venue: ScholarlyCommons
Publication date: 07/01/2007
Field of study

Business incentives have brought us within a small factor of achieving the database community\u27s Grand Challenge set out in the Asilomar Report of 1998. This paper makes the case for a new, focused Grand Challenge: Public Health for the Internet. The goal of PHI (or φ) is to enable collectives of hosts on the Internet to jointly monitor and promote network health by sharing information on network conditions in a peer-to-peer fashion. We argue that this will be a positive effort for the research community for a variety of reasons, both in terms of its technical reach and its societal impact. This version of the φ vision is targeted at readers in the database research community, but the effort is clearly multidisciplinary. A more generalist version of this paper will be maintained at http://openphi.net

ScholarlyCommons@Penn

Iterative MapReduce for Large Scale Machine Learning

Author: Borkar Vinayak
Bu Yingyi
Carey Michael J.
Condie Tyson
Polyzotis Neoklis
Ramakrishnan Raghu
Rosen Joshua
Weimer Markus
Publication venue
Publication date: 13/03/2013
Field of study

Large datasets ("Big Data") are becoming ubiquitous because the potential value in deriving insights from data, across a wide range of business and scientific applications, is increasingly recognized. In particular, machine learning - one of the foundational disciplines for data analysis, summarization and inference - on Big Data has become routine at most organizations that operate large clouds, usually based on systems such as Hadoop that support the MapReduce programming paradigm. It is now widely recognized that while MapReduce is highly scalable, it suffers from a critical weakness for machine learning: it does not support iteration. Consequently, one has to program around this limitation, leading to fragile, inefficient code. Further, reliance on the programmer is inherently flawed in a multi-tenanted cloud environment, since the programmer does not have visibility into the state of the system when his or her program executes. Prior work has sought to address this problem by either developing specialized systems aimed at stylized applications, or by augmenting MapReduce with ad hoc support for saving state across iterations (driven by an external loop). In this paper, we advocate support for looping as a first-class construct, and propose an extension of the MapReduce programming paradigm called {\em Iterative MapReduce}. We then develop an optimizer for a class of Iterative MapReduce programs that cover most machine learning techniques, provide theoretical justifications for the key optimization steps, and empirically demonstrate that system-optimized programs for significant machine learning tasks are competitive with state-of-the-art specialized solutions

arXiv.org e-Print Archive

CiteSeerX

Scrub: Online TroubleShooting for Large Mission-Critical Applications

Author: Barham Paul
Chanda Anupam
Chow Ichael
Condie Tyson
Deshpande Amol
Fonseca Rodrigo
Hunt Patrick
Kreps J.
Minos
Oliner A. J.
Zhao Xu
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 13/03/2018
Field of study

Scrub is a troubleshooting tool for distributed applications that operate under strict SLOs common in production environments. It allows users to formulate queries on events occurring during execution in order to assess the correctness of the application’s operation. Scrub has been in use for two years at Turn, where developers and users have relied on it to resolve numerous issues in its online advertisement bidding platform. This platform spans thousands of machines across the globe, serving several million bid requests per second, and dispensing many millions of dollars in advertising budgets. Troubleshooting distributed applications is notoriously hard, and its difficulty is exacerbated by the presence of strict SLOs, which requires the troubleshooting tool to have only minimal impact on the hosts running the application. Furthermore, with large amounts of money at stake, users expect to be able to run frequent diagnostics and demand quick evaluation and remediation of any problems. These constraints have led to a number of design and implementation decisions, that go counter to conventional wisdom. In particular, Scrub supports only a restricted form of joins. Its query execution strategy eschews imposing any overhead on the application hosts. In particular, joins, group-by operations and aggregations are sent to a dedicated centralized facility. In terms of implementation, Scrub avoids the overhead and security concerns of dynamic instrumentation. Finally, at all levels of the system, accuracy is traded for minimal impact on the hosts. We present the design and implementation of Scrub and contrast its choices to those made in earlier systems. We illustrate its power by describing a number of use cases, and we demonstrate its negligible overhead on the underlying application. On average, we observe a maximum CPU overhead of up to 2.5% on application hosts and a 1% increase in request latency. These overheads allow the advertisement bidding platform to operate well within its SLOs

Infoscience - École polytechnique fédérale de Lausanne

Crossref

Declarative Systems by

Author: Tyson Condie
Tyson Condie
Tyson Condie
Tyson Condie Abstract
Publication venue
Publication date: 01/01/2011
Field of study

CiteSeerX

eScholarship - University of California

Data deduplication with edit errors

Author: Conde-Canencia Laura
Condie Tyson
Dolecek Lara
Publication venue: HAL CCSD
Publication date: 01/06/2019
Field of study

International audienceIn this paper we tackle the problem of file dedu-plication for efficient data storage. We consider the case where the deduplication is performed on files that are modified by edit errors relative to the original version. We propose a novel block-level deduplication algorithm with variable-lengths in the case of non-binary alphabets. Compared to hash-based deduplication algorithms where file deduplication depends on the content of the hash keys or to brute force methods that compare files symbol-by-symbol, our algorithm significantly reduces the number of symbol comparisons and achieves high deduplication ratios. We present a theoretical analysis on the cost of the algorithm compared to naive methods and experimental results to evaluate the efficiency of our deduplication algorithm

Crossref

HAL-Université de Bretagne Occidentale